MGcount user guide
Getting started
What is MGcount
MGcount is a RNA-seq quantification tool conceived to incorporate ambiguous read alignments in a flexible way that is compatible with any biotype. This allows to extract more information from total RNA-seq datasets by simultaneous quantifying coding and non-coding transcripts, both small and long, where dealing with heterogeneous read-to-feature alignment ambiguities is key. When aligning reads to a reference genome, we distinguish between two types of ambiguous alignments:
- Multi-mappers: reads aligning to multiple genomic locations.
- Multi-overlaps: reads aligning to a genomic location with multiple annotated features.
To deal with the most frequent multi-overlaps (occuring when a smallRNA loci is embeded within a longRNA), MGcount hierarchically assigns reads to smallRNA and longRNA accounting for their transcript length disparity. Subsequently, MGcount models read-to-feature alignments in a graph. This is exploited to detect and report expression of sequence-similar annotated loci, where reads systhematically multi-map, as integrated features called communities.
System requirements
MGcount is built on the top of FeatureCounts, a well-known computational efficient software (Liao et. al., 2014). Please download it from the following link: http://subread.sourceforge.net/
Installation
MGcount is written in Python and is executed from the command line. You can either download the executable version (single binary file) or install it as a Python3 module.
Install and run as a Python3 module
You can install the package as a Python module:
pip3 install git+https://github.com/hitaandrea/MGcount.git
Once the package is installed, run the tool as a Python installed module:
python3 -m mgcount [args]
Download and run the executable program
Alternatively you can download the latest release as a single binary executable file here.
Save the program file to your Linux system and set the permissions to allow executing the file as program:
chmod +x mgcount
Once the file is executable, run the tool by calling the file from the command line with the desired arguments.
MGcount [args]
Workflow
MGcount workflow schema
During its execution, MGcount performs the following 3 steps:
1. Hierarchical assignation
First, a hierarchy is used to solve multi-overlapping ambiguities across biotypes during alignment-to-feature assignment based on transcript body length. Reads are assigned to genomic annotated features in three pre-defined sequential rounds: small, long_exon and long_intron. Hence, all reads with at least one alignment overlapping with an annotation in the current round are assigned to such feature and skipped in subsequent rounds.
2. Multi-loci communities recognition
To quantify multi-mapping reads, MGcount builds a directed weighted graph {G=(V,E)} where each vertex (V) is an annotated feature and a pair of directional edges (E) connect two features for which common multi-mapping reads exist. Edge weithgs are defined as the ratio of multi-mapping reads between the two vertex normalized by the total number of reads assigned to the source vertex. Vertex weights are defined as the log-transformed number of assigned alignments (Fig c). Resultant graphs structures capture the multiple-loci topologies of different RNA biotypes. Two graphs are sepparately build for smallRNA and longRNA, using the full pool of input alignments. Subsequently, highly-related features are grouped together by minimizing the map equation with the communities detection approach described by Rosvall et al., 2008 .
3. Expression matrix generation
MGcount generates an output expression matrix for each hierarchical assignation round that are appended together in a single output matrix. For each read, each alignment first gets an 1/N count, where N is the number of multi-mappers or residual multi-overlaps within biotypes that survive the hierarchical assignment. Next, counts for annotations which have been aggregated together in a community by the map equation are summed up. In this way, the systematic ambiguity in multi-mapping reads gets collapsed into a single MG community while the remaining signal is reported as fractionated counts.
Usage description
Inputs
Three inputs are required to run the program:
- Input alignment file: a .txt file listing the paths to the .bam alignment input files by line
- Annotations gtf file: a .gtf file containing a set of RNA feature annotations
- Output directory path: a string specyfing the path to the directory where MGcount outputs will be stored
Configurable arguments
Configurable arguments list
Optional arguments to configure a MGcount run include:
| Argument | Description | Default value |
|---|---|---|
| –paired_flag (-p) | Paired end flag. If null, the assignation occurs in single-end mode. | False |
| –strand_option (-s) | Library strandness. Options available are 0: unstranded, 1: forward-stranded and 2:reverse-stranded | 1 |
| –featureCounts_path | Path to featureCounts software executable file | /usr/bin/featureCounts |
| –btyperounds_filename | Optional .csv file with biotype to assignation round associations. It should be a two columns table where columnames are, in order, “biotype” and “counting_round”. | |
| –feat_small | GTF feature type entry for smallRNA reads assignation | transcript |
| –feat_output_small | GTF field name for which to summarize counts of longRNA assigned reads | transcript_name |
| –feat_biotype_small | GTF field name defining biotype for smallRNA features | transcript_biotype |
| –ml_flag_small | Multi-loci graph detection based groups flag for smallRNA features | 1 |
| –min_overlap_small | Minimal feature-alignment overlapping fraction for assigning a read to a smallRNA feature | 1 |
| –feat_output_long | GTF field name for which to summarize counts of longRNA assigned reads | gene_name |
| –feat_biotype_long | GTF field name defining biotype for longRNA features | gene_biotype |
| –min_overlap_long | Minimal feature-alignment overlapping fraction for assigning a read to a longRNA feature | 1 |
| –ml_flag_long | Multi-loci graph detection based groups flag for longRNA features | 1 |
| –th_low | Low minimal threshold of feature-to-feature multi-mapping fraction. | 0.01 |
| –th_high | High minimal threshold of feature-to-feature multi-mapping fraction. | 0.75 |
| –subs | Optional sub-sapling number of alignments to train the multigraph network. If 0, include all. | 0 |
| –n_cores (-T) | Number of cores for parallelisation by sample | 1 |
| –sample_id | SampleID input file names | None |
| –seed | Optional fixed seed for random numbers generation during communities detection |
Configurable arguments details
Sequencing data type
Two arguments need to be set following the type of input data. For a correct interpretation of RNA-seq data during assignation, the integer argument –strand_option need to be set according to the strandness of the library preparation method utilized (0:unstranded, 1: forward-stranded, 2:reverse-stranded). If dealing with paired reads, –paired_flag should be added to the command line call.
Multi-core mode
MGcount may process samples in parallel in all three steps of the workflow (hierarhical assignation, multigraph generation and count matrix building). The number of CPUs to be used by MGcount can be defined with the –n_cores option. We only recommend this if runing the tool in a server or a high performance computing seting due to high RAM consumption in parallel mode.
Assignation rounds configuration
| Round | feature | feature_output | feature_biotype | min_overlap | ml_flag |
|---|---|---|---|---|---|
| small | transcript | transcript_name | transcript_biotype | 1 | True |
| long_exon | exon | gene_name | gene_biotype | 1 | True |
| long_intron | gene | gene_name | gene_biotype | 1 | True |
At each round of the hierarchical assignation, MGcount extracts the set of annotations in the gtf with entry type –feature whose –feature_biotype attribute is included in the list of biotypes associated to the round (as defined by the csv file –btyperounds_filename). Subsequently, alignments are assigned to the restricted set of annotations whenever a minimum read fraction (as defined by –min_overlap) overlaps with an annotated feature of the extracted annotations subset.
If featureCounts software is not accessible on the system path (/usr/bin/), the full path to the software should be set through –featureCounts_path.
The association of diferent biotypes to either “long” or “small” assignation rounds can be customized in a csv file and parsed with –btypecrounds_filename argument to MGcount. The csv file must be a two columns table with names biotype and assignation_round.
By default, MGcount utilizes a csv file embeded with the program (or alternative installed with the Python module), that is located in the /mgcount/data subfolder of the Github repository. This table links the set of biotypes encountered in the 4 integrated gtf provided (Arabidopsis, Human, Mouse and Nematode) to the corresponding pre-defined small and long rounds.
For runing MGcount in further species or different annotations set, please make sure the biotypes you want to include in the quantification are correctly listed in this table for MGcount to recognize them.
At each round of the hierarchical assignation, alignment-feature assignation pairs are determined with FeatureCounts restricted to the designated subset of the .gtf annotated features. Each round can be configured by the user through the following five arguments:
Communities detection
To speed-up computation time, a fixed number of random sub-sampled alignments per sample can be set to build the graph through –subs.
The –seed argument may be set to guarantee exact solutions accross runs. The seed will be used as the starting point to generate random numbers during the communities detection approach.
MGcount ignores weak edges during map equation optimization based on a high threshold –th_high (by default 0.75) and a low threshold –th_low (by default 0.01) to prevent for over-fitting (splitting of large densely connected communities and merging of small loosely connected communities). Thresholds are empoyed as follows according to the type of graph:
LongRNA graph: All edges whose weight are below the high threshold are ignored for the longRNA graph. This avoids collapsing together certain features sharing only partial similarity. Given the long body length, multi-mappers may occur in only a specific part of the locus. In this situations, the threshold determines how large should the shared reads proportion between two features to be considered for a community. Lower thresholds will tend to aggregate low ambiguity in commuinties while high threholds will force features to remain single by spliting the multi-mapping reads as a fraction.
SmallRNA graph: The use of the high threshold or low threshold for weak edges filtering deppends on the edge weights distribution in the smallRNA graph. Here, for each biotype (microRNA, piRNA, snRNA, …), the threshold that is closer to the graph weights’ first quantile is employed. In this way, for biotypes were repeated loci are identical or nearly identical (f.i. microRNA), only high weights above the high threshold will be considered for communities while for biotypes with large groups of similar loci (snRNA, YRNA and pseudogenes, ….), all weights will be considered.
Outputs
At the end of its execution, MGcount provides the following outputs:
- Count matrix: Expression estimation matrix where each row corresponds to a transcript feature and each column corresponds to 1 input .bam file.
- Features metadata: It in, which match row names in the count matrix; the counting round when assigned; a flag designing whether a feature is a unique annotation or a MG community; and the feature biotype.
- Multi-mapping graph adjacency matrix: Adjacency matrix of links between annotated features allowing to generate the smallRNA or/and longRNA multi-mapping graph, encoded as sparse Matrix in MatrixMarket standard format.
- Multi-mapping graph communities (MGcommunities): Table linking each original feature in the gtf with the resultant feature matching count matrix and feature metadata feature identifiers. It includes both unique features (that remain unmodified) and aggregated features (that are collapsed following MG communities). In addition, the table provides with the total number of alignments per feature.
Tutorials
T1 - Prepare inputs and execute
In the following tutorials, we use two Human Brain samples as example to walk through MGcount execution. First of all, let’s create a folder and download the aligment bam files of the two samples. (The downloading process might take a few minutes).
mkdir mgcount_tutorial
cd mgcount_tutorial
wget https://filedn.com/lTnUWxFTA93JTyX3Hvbdn2h/mgcount/humanbrain_bamfiles.zip
unzip humanbrain_bamfiles.zip -d input_bamfiles
To run MGcount, we need to provide the software with a .txt file specifying the paths to the input aligment files. Here, these are the two .bam alignment files we just downloaded. We can generate this file from the command line:
printf "$PWD/%s\n" input_bamfiles/* > input_bamfilenames.txt
The other required input is a .gtf file with transcript features annotations. MGcount repository provides with four ready gtf files integrating annotations from several databases. Next, we will download them and use the Human annotations file.
wget https://filedn.com/lTnUWxFTA93JTyX3Hvbdn2h/mgcount/integrated_annotations_gtf.zip
unzip integrated_annotations_gtf.zip -d annotations_gtf
Once we have both the gtf and the input .bam files, we can simply run MGcount as a python3 module. A python module is run by calling python3 with the parameter -m and the name of the module. After this, we can specify MGcount arguments.
For this example, we parse the ready-integrateg gtf for the Human genome, the txt file we just created containing the path to the input aligment .bam files and a string designating the directory where MGcount outputs will be generated.
python3 -m mgcount --gtf annotations_gtf/Homo_sapiens.GRCh38.gtf --outdir outputs --bam_infiles input_bamfilenames.txt
If you have been able to run MGcount successfully, your output directory should contain 6 new files generate by MGcount including the RNA count matrix, the feature metadata table and two files containing the graph structure and the communities detected for longRNA and smallRNA respectively.
We are done!
T2 - Explore quantification outputs
In this tutorial, we will load the MGcount outputs we obtained in tutorial 1 and we will explore them from R. MGcount repository contains a few supporting scripts in R providing with side functionalities (manage annotations, visualize MGcount outputs…). It is possible to download the sole folder containing these scripts with the following shell command:
cd mgcount_tutorial
svn export https://github.com/hitaandrea/MGcount/trunk/R
Once the scripts are downloaded, we are ready to launch R. Please, start an R session and define the tutorial root directory as a variable.
Further, this tutorial uses the following R packages. Please, install them with install.packages() if you wish to run this tutorial on your system.
## Warning in fun(libname, pkgname): couldn't connect to display ":0"
The main output of MGcount is the count_matrix.csv file containing the feature by sample expression matrix. We can import it into R as any regular csv.
counts <- read.csv(paste0(root_dir,'outputs/count_matrix.csv'), row.names = 1)
colnames(counts) <- sub('_Aligned.genome.dedup','',colnames(counts))Each row in the count table is a feature for which expression has been quantified and each column is associated to a sample. By interrogating for the matrix dimension, we see the matrix contains two columns corresponding to the two human brain libraries. The matrix is ready to be used in any RNA-seq downstream analysis.
## [1] 47539 2
We can look at the total number of reads assigned from each library by suming up each of the two rows. The mean counts over the two libraries is approximately 4.2 million.
## Human_Brain_total_100ng_1 Human_Brain_total_100ng_2
## 3837246 4578696
## [1] "Mean counts: 4207971.05"
Let’s import now the feature_metadata output. This table reports feature-related attributes such as the assignation round to which the annotation belongs, a flag stating wheter the feature is an aggregated community of annotations or an individual feature and the biotype associated to the feature. The feature identifiers in the counts_matrix and the feature_metadata match and therefore, this table can facilitate the extraction of particular features from the count matrix, e,g, a certain biotype, the exonic counts, the subset of features aggregated in communities, etc…)
Here, for example, we profit from the feature_metadata table to look at counts distribution by assignation round.
df <- cbind('counts' = rowSums(counts), feat_metadata)
ggplot(df, aes(x = assignation_round, y = counts)) + geom_violin(fill = 'grey') +
scale_y_continuous(trans = 'log', lim = c(10,100000), breaks = c(10,100,1000,10000,100000)) + theme_pubclean() +
geom_point(position = position_jitter(seed = 1, width = 0.4), size = 0.005, alpha = 0.2)Also, we can exploit the feature_metadata table to extract stats from the MGcount run. Here we look at the proportion of aggregated features (community_flag variable) by biotype (feature_biotype). The output shows how smallRNA biotypes tend to be more aggregated in communities because duplicated loci are more frequent.
|
|
Below, we display a few random rows from the table for illustration.
feat_subset <- with(feat_metadata, feat_metadata[c(
sample(which(assignation_round == 'small' & community_flag == "True"), 2),
sample(which(assignation_round == 'small' & community_flag == "False"), 2),
sample(which(assignation_round == 'long_exon' & community_flag == "True"), 1),
sample(which(assignation_round == 'long_exon' & community_flag == "False"), 1),
sample(which(assignation_round == 'long_intron' & community_flag == "True"), 1),
sample(which(assignation_round == 'long_intron' & community_flag == "False"), 1)),])
kable(feat_subset) %>% scroll_box(height = "300px")| feature | assignation_round | annotations_subset | feature_type | feature_output | feature_biotype | community_flag | |
|---|---|---|---|---|---|---|---|
| 1651 | RNU6-838P_RNU6-1123P_RNU6-1310P_RNU6-788P_RNU6-1238P | small | small | transcript | transcript_name | snRNA | True |
| 1842 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | small | small | transcript | transcript_name | snoRNA | True |
| 2140 | hsa-miR-137-3p | small | small | transcript | transcript_name | miRNA | False |
| 1016 | PIRNA54381 | small | small | transcript | transcript_name | piRNA | False |
| 15607 | HMGA1-exon | long_exon | long | exon | gene_name | protein_coding | True |
| 20339 | NUCKS1-exon | long_exon | long | exon | gene_name | protein_coding | False |
| 36520 | GATD3A_GATD3B_FP565260.2-intron | long_intron | long | gene | gene_name | protein_coding | True |
| 36140 | FCGRT-intron | long_intron | long | gene | gene_name | protein_coding | False |
Next we show how to generate a barplot showing the read distribution by biotype group. First lets load a table to group biotypes by category. We will use this to group less abundant biotypes in larger groups for visualization purposes.
bcats <- define_bcats()
feat_df <- merge(feat_metadata, bcats, by.x = 'feature_biotype', by.y = 'biotype', all.x = TRUE, all.y = FALSE)
## Add exon/intron distinction to biogroup based on counting round
feat_df$biogroup <- as.character(feat_df$biogroup)
feat_df$biogroup[feat_df$biogroup == 'Protein_coding'] <- 'Protein_coding_exon'
feat_df$biogroup[feat_df$biogroup == 'Long_non_coding'] <- 'Long_non_coding_exon'
feat_df$biogroup[feat_df$assignation_round == 'long_intron'] <-
sub('exon','intron',feat_df$biogroup[feat_df$assignation_round == 'long_intron'])
feat_df$biogroup <- as.factor(feat_df$biogroup)
feat_df <- feat_df[match(rownames(counts), feat_df$feature),]We then combine feature_metadata table with the count_matrix again and sum up the counts to get the total number of reads by biotype. We do this sepparately by biotype groups and small-non-coding individual biotypes to further represent the small non-coding spectrum.
## Generate biotype matrix
df <- data.frame(counts %>% group_by(feat_df$biogroup) %>% summarise_all(sum),
check.names = FALSE); names(df)[1] <- 'biotype'
biotype <- reshape(df, idvar = 'biotype', varying = list(names(df)[-1]),
v.names = 'counts', times = names(df)[-1],
timevar = 'sn', direction = 'long')
biotype$biotype <- as.character(biotype$biotype)
bgroups <- c("Hybrid","Long_pseudogenes","Protein_coding_intron","Long_non_coding_intron",
"Protein_coding_exon","Long_non_coding_exon" ,"Short_non_coding","tRNA","rRNA")
biotype$biotype[biotype$biotype %nin% bgroups] <- 'Hybrid'
biotype$biotype <- factor(biotype$biotype, levels = bgroups)
## Generate non-coding biotype table
df <- data.frame(counts[feat_df$biocat == 'sNC',] %>%
group_by(feat_df$feature_biotype[feat_df$biocat == 'sNC']) %>% summarise_all(sum),
check.names = FALSE); names(df)[1] <- 'biotype'
biotype_snc <- reshape(df, idvar = 'biotype', varying = list(names(df)[-1]),
v.names = 'counts', times = names(df)[-1],
timevar = 'sn', direction = 'long') Once we have extracted the counts by biotype group, let’s employ ggplot to visualize the read distribution profiles as barplots.
Reads distribution by biotype:
## ---- Abundance plot by biotype group
colP <- c('violetred1','slateblue1','darkgrey','lightgrey',
'springgreen4','springgreen3','violetred4','tan3','tan4')
names(colP) <- bgroups
p1 <- ggplot(biotype, aes(x = sn, y = counts, group = biotype, fill = biotype)) +
geom_bar(stat = 'identity', colour = 'black', width = 0.8) +
coord_flip() + xlab('') + ylab('Number of assigned reads') + theme_pubclean() +
guides(fill=guide_legend(nrow=4)) + scale_fill_manual(values=colP) +
theme(legend.position = 'top', legend.title = element_blank())
## ---- Relative abundance plot by small non-coding biotype
colP <- c('#e6be97','#848CFF','#4c2382', '#50eb76','tomato','#FDFF87','#FFAE51','darkorchid1','gold1','sienna3')
p2 <- ggplot(biotype_snc, aes(x = sn, y = counts, group = biotype, fill = biotype)) +
geom_bar(stat = 'identity', position = 'fill', colour = 'black', width = 0.8) + theme_pubclean() +
coord_flip() + xlab('') + ylab('Proportion of small non-coding assigned reads') +
guides(fill=guide_legend(nrow=4)) + scale_fill_manual(values = colP) +
theme(axis.text.y = element_blank(), legend.position = 'top', legend.title = element_blank())
## Display plots together
ggarrange(p1, p2, ncol = 2, widths = c(2,1))Finally, we import the multigraph_communities tables. These tables link each original feature in the gtf with the resultant feature matching the count matrix and feature metadata identifiers. It includes both unique features (that remain unmodified) and aggregated features (that are collapsed following MG communities). Thus, we can track back the features groupped in each aggregated feature.
Here we will use the feature_metadata subset from before to explore the original features in the gtf forming each new MG community feature.
csmall <- read.csv(paste0(root_dir,'outputs/multigraph_communities_small.csv'))
clong <- read.csv(paste0(root_dir,'outputs/multigraph_communities_long_exon.csv'))
kable(subset(csmall, feature %in% feat_subset$feature)) %>% scroll_box(height = "300px")| transcript_name | transcript_biotype | naln | naln_community | community_flag | community_id | community_name | community_biotype | feature | |
|---|---|---|---|---|---|---|---|---|---|
| 262 | RNU6-1310P | snRNA | 1 | 56 | True | snRNA-cl-32 | RNU6-838P_RNU6-1123P_RNU6-1310P_RNU6-788P_RNU6-1238P | snRNA | RNU6-838P_RNU6-1123P_RNU6-1310P_RNU6-788P_RNU6-1238P |
| 645 | RNU6-788P | snRNA | 1 | 56 | True | snRNA-cl-32 | RNU6-838P_RNU6-1123P_RNU6-1310P_RNU6-788P_RNU6-1238P | snRNA | RNU6-838P_RNU6-1123P_RNU6-1310P_RNU6-788P_RNU6-1238P |
| 900 | RNU6-838P | snRNA | 45 | 56 | True | snRNA-cl-32 | RNU6-838P_RNU6-1123P_RNU6-1310P_RNU6-788P_RNU6-1238P | snRNA | RNU6-838P_RNU6-1123P_RNU6-1310P_RNU6-788P_RNU6-1238P |
| 2082 | RNU6-1238P | snRNA | 1 | 56 | True | snRNA-cl-32 | RNU6-838P_RNU6-1123P_RNU6-1310P_RNU6-788P_RNU6-1238P | snRNA | RNU6-838P_RNU6-1123P_RNU6-1310P_RNU6-788P_RNU6-1238P |
| 2147 | RNU6-1123P | snRNA | 8 | 56 | True | snRNA-cl-32 | RNU6-838P_RNU6-1123P_RNU6-1310P_RNU6-788P_RNU6-1238P | snRNA | RNU6-838P_RNU6-1123P_RNU6-1310P_RNU6-788P_RNU6-1238P |
| 2754 | SNORD115-1 | snoRNA | 2429 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2755 | SNORD115-2 | snoRNA | 600 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2756 | SNORD115-3 | snoRNA | 1566 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2757 | SNORD115-4 | snoRNA | 2465 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2758 | SNORD115-5 | snoRNA | 3984 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2759 | SNORD115-6 | snoRNA | 2701 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2760 | SNORD115-7 | snoRNA | 383 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2761 | SNORD115-8 | snoRNA | 1815 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2762 | SNORD115-9 | snoRNA | 5646 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2763 | SNORD115-10 | snoRNA | 4233 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2764 | SNORD115-11 | snoRNA | 5787 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2765 | SNORD115-12 | snoRNA | 5646 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2766 | SNORD115-13 | snoRNA | 2602 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2767 | SNORD115-14 | snoRNA | 2600 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2768 | SNORD115-15 | snoRNA | 4024 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2769 | SNORD115-16 | snoRNA | 3272 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2770 | SNORD115-17 | snoRNA | 4093 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2771 | SNORD115-18 | snoRNA | 4093 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2772 | SNORD115-19 | snoRNA | 4093 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2773 | SNORD115-20 | snoRNA | 3539 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2774 | SNORD115-21 | snoRNA | 2801 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2775 | SNORD115-22 | snoRNA | 3958 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2776 | SNORD115-23 | snoRNA | 1982 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2777 | SNORD115-24 | snoRNA | 973 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2778 | SNORD115-25 | snoRNA | 778 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2779 | SNORD115-26 | snoRNA | 3029 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2780 | SNORD115-27 | snoRNA | 43 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2781 | SNORD115-28 | snoRNA | 137 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2782 | SNORD115-29 | snoRNA | 5787 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2783 | SNORD115-30 | snoRNA | 1630 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2784 | SNORD115-31 | snoRNA | 315 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2785 | SNORD115-32 | snoRNA | 718 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2786 | SNORD115-33 | snoRNA | 1928 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2787 | SNORD115-34 | snoRNA | 3814 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2788 | SNORD115-35 | snoRNA | 399 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2789 | SNORD115-36 | snoRNA | 5787 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2790 | SNORD115-37 | snoRNA | 97 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2791 | SNORD115-38 | snoRNA | 2592 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2792 | SNORD115-39 | snoRNA | 3541 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2793 | SNORD115-40 | snoRNA | 3464 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2794 | SNORD115-41 | snoRNA | 3288 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2795 | SNORD115-42 | snoRNA | 3956 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2796 | SNORD115-43 | snoRNA | 5787 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2797 | SNORD115-44 | snoRNA | 2436 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 2799 | SNORD115-48 | snoRNA | 629 | 125440 | True | snoRNA-cl-214 | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc | snoRNA | SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc |
| 4552 | PIRNA54381 | piRNA | 36 | 36 | False | PIRNA54381 | |||
| 5508 | hsa-miR-137-3p | miRNA | 79 | 79 | False | hsa-miR-137-3p |
| gene_name | gene_biotype | naln | naln_community | community_flag | community_id | community_name | community_biotype | feature | |
|---|---|---|---|---|---|---|---|---|---|
| 2790 | NUCKS1 | protein_coding | 1714 | 1714 | False | NUCKS1 | |||
| 8352 | HMGA1P2 | processed_pseudogene | 4 | 78 | True | long-cl-6754 | HMGA1 | protein_coding | HMGA1 |
| 11087 | AL139095.4 | lncRNA | 1 | 78 | True | long-cl-6754 | HMGA1 | protein_coding | HMGA1 |
| 11653 | HMGA1 | protein_coding | 65 | 78 | True | long-cl-6754 | HMGA1 | protein_coding | HMGA1 |
| 14862 | HMGA1P1 | processed_pseudogene | 4 | 78 | True | long-cl-6754 | HMGA1 | protein_coding | HMGA1 |
| 23575 | HMGA1P3 | processed_pseudogene | 1 | 78 | True | long-cl-6754 | HMGA1 | protein_coding | HMGA1 |
| 24010 | HMGA1P6 | processed_pseudogene | 3 | 78 | True | long-cl-6754 | HMGA1 | protein_coding | HMGA1 |
| gene_name | gene_biotype | naln | naln_community | community_flag | community_id | community_name | community_biotype | feature | |
|---|---|---|---|---|---|---|---|---|---|
| 15793 | AL672032.1 | lncRNA | 1 | 250 | True | long-cl-11813 | GATD3A_GATD3B_FP565260.2 | protein_coding | GATD3A_GATD3B_FP565260.2 |
| 33419 | FCGRT | protein_coding | 53 | 53 | False | FCGRT | |||
| 34841 | GATD3B | protein_coding | 80 | 250 | True | long-cl-11813 | GATD3A_GATD3B_FP565260.2 | protein_coding | GATD3A_GATD3B_FP565260.2 |
| 34842 | FP565260.2 | protein_coding | 71 | 250 | True | long-cl-11813 | GATD3A_GATD3B_FP565260.2 | protein_coding | GATD3A_GATD3B_FP565260.2 |
| 35199 | GATD3A | protein_coding | 98 | 250 | True | long-cl-11813 | GATD3A_GATD3B_FP565260.2 | protein_coding | GATD3A_GATD3B_FP565260.2 |
We are done!
T3- Explore output multi-graph
Below, we use a few functions defined in “mg_visualize.R” to graphically explore the multi-mapping graph topology.
The function ‘mg_build’ takes the Multi-Graph adjacency matrix and the MG communities and creates an igraph object. The next two functions provide with the code to generate a few default plots given a Multi-Graph igraph object.
As an example, we will load the smallRNA Multi-Graph adjacency matrix and we will subset it to explore the sub-graph associated to microRNA features.
The adjacency matrix is stored as a sparse symetric matrix in MatrixMarket format, which can be imported in R by the readMM function from ‘Matrix’ R package. We also import the table containing all annotated features and MG communities outputs.
dir.create(paste0(root_dir,'mgplots'))
## Import ml data and adjacency matrix for small assignation round
inputM <- readMM(paste0(root_dir,'outputs/multigraph_matrix_small.mtx'))
multiloci_table <- read.csv(paste0(root_dir,'outputs/multigraph_communities_small.csv'))Next we subset the matrix by selecting the features under ‘miRNA category’. Each row in the Multi-Graph communities table table corresdponds to a feature annotation with non-zero assignments in the small round. The row index of each feature defines its column and row position in the Multi-Graph adjacency matrix.
## Extract microRNA matrix subset
btype <- 'miRNA'
idx <- which(multiloci_table$transcript_biotype == btype)
inM <- inputM[idx,idx]; ml <- multiloci_table[idx,]Necessary steps to convert the adjacency matrix and the communities table to an igraph object are in the mg_build function.
## Generate microRNA graph object
attr <- 'transcript_name'
g = mg_build(inM, ml, attr)
g <- delete_vertices(g, V(g)[V(g)$weight < 1])We can explore the patterns in the feature gene symbols established by the HGNC and link that to specific colors during visualization
## Extract microRNA mature extreme information from HUGO symol
V(g)$color_HUGO1 <- 'grey'
V(g)$color_HUGO1[grep('-3p',V(g)$feat)] <- 'orange'
V(g)$color_HUGO1[grep('-5p',V(g)$feat)] <- 'dodgerblue'
## Extract microRNA class from HUGO symbol
V(g)$color_HUGO2 <- 'grey'
idx <- grep('hsa-miR',as.character(V(g)$feat))
hugo <- as.factor(gsub('.*?([0-9]+).*', '\\1', V(g)$feat[idx]))
V(g)$color_HUGO2[idx] <- sample(grDevices::colors()[grep('gr(a|e)y', grDevices::colors(), invert = T)],
length(unique(hugo)))[hugo]The function mg_plotset generates a set of different plots of the Multi-Mapping graph and stores them as png files with a user-given file prefix (plotfile variable).
## Define plot colot
V(g)$customColor <- V(g)$color_HUGO1
## Standard plots
mg_plotset(g, plotfile= paste0(root_dir,'mgplots/mg_miRNA'))Raw visualization of the graph
## Display plot
par(mar=c(0.01,0.01,0.01,0.01))
img <- magick::image_read(paste0(root_dir,'mgplots/mg_miRNA.png'))
plot(img)Visualization of the graph colored by -3/-5 patterns in microRNA transcript symbol
## Display plot
par(mar=c(0.01,0.01,0.01,0.01))
img <- magick::image_read(paste0(root_dir,'mgplots/mg_miRNA_color.png'))
plot(img)Visualization of the graph colored by -3/-5 patterns in microRNA transcript symbol and detected communities
## Display plot
par(mar=c(0.01,0.01,0.01,0.01))
img <- magick::image_read(paste0(root_dir,'mgplots/mg_miRNA_cl_color.png'))
plot(img)We are done!
Integrating annotation sources
MGcount is a tool conceived to analyse heterogeneous datasets capturing diverse non-coding transcript types but the scope of features quantified by MGcount is bounded by the features annotated in the reference gtf file used as input. On this line, altough MGcount can be executed with any gtf annotations file (Ensembl, Gencode, etc…), we provide with the option to integrate annotations from several databases to take into account a more complete or/and up-to-date annotations set in the quantification, specially for small regulatory RNAs (piRNA, tRF, microRNA, siRNA). These transcripts are not necessarily annotated in general databases.
In tutorial 1, we used a ready integrated gtf for the human genome. Besides, the directory integrated_annotations_gtf we downloaded in tutorial 1, provides with integrated annotations gtf files for Arabidopsis, Mouse and Nematode that can be used for runing MGcount in the corresponding spcies. The following databases have been integrated for each specie:
- Arabidopsis thaliana: Ensembl, miRBase (microRNA) and RNACentral (siRNA);
- Homo Sapiens Ensembl, DASHR (piRNA and tRNA fragments [tRF]) and miRBase (microRNA);
- Mus musculus: Ensembl, miRBase (microRNA) and RNAcentral (piRNA);
- Caenorhabditis elegans: Ensembl and miRBase (microRNA).
For runing MGcount in any other genome, we encourage to follow the same procedure we followed to generate the last 4 gtfs. The script we used to generate the last 4 gtf is provided in the R folder from the MGcount github repository (that can be individually download with the command in tutorial 2). We hope the script can be used as a template for custom gtf integration in other species.